12/01/2020

Agenda

  • Goals
  • Simple Neural Networks
  • Deep Neural Networks
  • Towards real data applications

Goals

Introduction

  • Neural Networks (NN) are flexible models for supervised learning:
    • Regression
    • Classification
  • NN can approximate any continuous function arbitrarily well
  • Historically, NN were inspired by modeling biological neural networks

Introduction

  • Our neural system has 86 billion neurons and they are connected with \(10^{14}\) - \(10^{15}\) synapses
  • Neuron receives input signals from its dendrites and produces output signals along its axon
  • The axon eventually branches out and connects via synapses to dendrites of other neurons

How it works

  • \(x_0,x_1,x_2\): input signals
  • \(w_0, w_1, w_2\): weights or synaptic strengths
  • If the weighted sum of the input signals is above certain threshold, the neuron fires and sends a spike along its axon
  • \(f(\cdot)\): activation function, which outputs the frequency of neuron firing

Simple Neural Networks

Simple Neural Networks

Activation functions

  • Sigmoid: it maps real-valued input to a range between 0 and 1
  • TanH: it maps real-valued input to a range between -1 and 1
  • ReLU (Rectified Linear Unit): it takes a real-valued input and thresholds it at zero (replaces negative values with zero)
  • We only consider nonlinear activation functions because linear activations degenerate (linear transformation of linear transformation is still linear transformation)

NN training

  • Regression: find weights (parameters) which minimize the squared error loss \[\widehat{\mathbf{w}} = \text{argmin}_{\mathbf{w}} \dfrac{1}{2} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] where \(\widehat{y}_i\) is the output from the NN with input \(x_i\) and weights \(\widehat{\mathbf{w}}\)
  • Binary classification: find weights which minimize the cross-entropy loss \[\widehat{\mathbf{w}} = \text{argmin}_{\mathbf{w}} \sum_{i=1}^{n} \left\{ - y_i \log (\hat{y}_i) - (1 - y_i) \log (1 - \hat{y}_i) \right\}\] where \(0 \leq \widehat{y}_i \leq 1\) is the output from the NN with sigmoid activation function in the last layer

NN training

How to minimize these loss functions? We use gradient descent (via back-propagation) to find \(\widehat{\mathbf{w}}\)!

  • Forward-propagation: given inputs and weights, the outputs are determined by following the network
  • Backward-propagation: training the NN (i.e estimating the weights) is via a reverse process, (taking derivatives of error) from outputs back to inputs

For those who are interested: stochastic gradient descent, mini-batches, adam algorithm

Coding complex Neural Networks can be challenging: thankfully, there are some existing frameworks that do it for us. Check out https://www.tensorflow.org/

Gradient descent

Deep Neural Networks

Deeper Neural Networks

Number of layers and hidden units

  • More layers and hidden units increase the flexibility of the NN but tends to overfit
  • Rule of thumb: choose 2 or 3 hidden layers and moderate to large number of hidden units and use regularization (e.g. \(\ell_2\) regularization, dropout) to prevent overfitting

\(\ell_2\) regularization

  • Add \(1/2 \lambda w^2\) to the loss/objective function for each weight \(w\). \(\lambda\) is a tuning parameter that controls the strength of the regularization
  • Each neural network above has 20 hidden units, but changing the regularization strength makes its final decision regions smoother with a higher regularization
  • In practice, \(\lambda\) is chosen using cross-validation

Dropout

  • In practice, dropout rate is often chosen to be \(50\%\)

Practical Issues

  • Data preprocessing
    • Normalization: standardization or scale each variable to \((-1, 1)\)
    • Whitening: principal component analysis
  • Weight initialization
    • Do not set all the initial weights to zero
    • Instead, for a ReLU neuron with \(m\) inputs, draw \(w_i \sim \mathcal{N}(0, 2/m)\)

Towards real data applications

Convolutional Neural Networks

Convolutional Neural Networks

Why not use regular NN?

  • A 2D color image is a 3 dimensional array (width, height, depth) of pixel values, where Depth corresponds to 3 color channels
  • A greyscale image is a matrix (width, height) of pixel values
  • A small \(200 \times 200\) color image would lead to \(120,000\) weights for each hidden unit
  • We may need hundreds of hidden units
  • The full connectivity of a regular NN is wasteful and the huge number of parameters would quickly lead to overfitting

Convolutional Neural Networks (CNN)

  • Convolutional Neural Networks (CNN or ConvNet) is a NN specifically designed for image inputs
  • Very popular in computer vision and imaging analysis
  • Most common task is to classify images

1. Convolution

1. Convolution

We slide the orange matrix over our original image (green) by 1 pixel (also called stride) at the time and for every position, we compute element wise multiplication and add the result to get the corresponding element of the output matrix (pink).

2. ReLU

Apply ReLU (element wise) to the feature map to introduce non-linearity.

3. Pooling

Max pooling progressively reduces the spatial size of each feature map while keeping the most important information. It reduces the amount of parameters and computation in the network, and hence also control overfitting.

4. Fully Connected Layer

The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset.

Summary

Conclusions

  • DNN are very popular these days. They seem to work best in highly non-linear but low-noise problems (think of images). It is unclear how successful they are in high-noise social science/economics type of applications
  • Machine Learning vs Statistical Learning
  • Quantification of uncertainty in Neural Networks is an active area of research

e-CIS

Please fill out the anonymous electronic course evaluation. Feel free to leave your feedback, the course can always improve thanks to students’ input!

Question time